PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

related works.tex (4907B)


      1 \section{Related Work}
      2 \label{sec:fitb:related work}
      3 The \textsc{nlp} and knowledge base related work is presented in Chapter~\ref{chap:context}, and the relation extraction related work is presented in Chapter~\ref{chap:relation extraction}.
      4 The main approaches we built upon are:
      5 \begin{itemize}
      6 	\item Distant supervision (Section~\ref{sec:relation extraction:distant supervision}, \cite{distant}): the method we use to obtain a supervised dataset for evaluation;%
      7 		\sidenote{As explained in Section~\ref{sec:relation extraction:clustering}, this is sadly standard in the evaluation of clustering approaches.}
      8 	\item \textsc{pcnn} (Section~\ref{sec:relation extraction:pcnn}, \cite{pcnn}): our relation classifier, which was the state-of-the-art supervised relation extraction method at the time;
      9 	\item Rel-\textsc{lda} (Section~\ref{sec:relation extraction:rellda}, \cite{rellda}): the state-of-the-art generative model we compare to;
     10 	\item \textsc{vae} for relation extraction (Section~\ref{sec:relation extraction:vae}, \cite{vae_re}): the overall inspiration for the architecture of our model, with which we share the entity predictor;
     11 	\item Self\textsc{ore} (Section~\ref{sec:relation extraction:selfore}, \cite{selfore}): an extension of our work, which, alongside their own approach, proposed an improvement of our relation classifier by replacing the \textsc{pcnn} by a \bertcoder{}.
     12 \end{itemize}
     13 In this section, we give further details about the relationship between our losses and the ones derived by \textcite{vae_re}.
     14 As a reminder, their model is a \textsc{vae} defined from an encoder \(Q(r\mid \vctr{e}, s; \vctr{\phi})\) and a decoder \(P(\vctr{e}\mid r, s; \vctr{\theta})\) as:
     15 \begin{marginparagraph}
     16 	The prior of a conditional \textsc{vae} \(P(r\mid\vctr{\theta})\) is usually conditioned on \(s\) too.
     17 	However, this additional variable is not used by \textcite{vae_re}.
     18 \end{marginparagraph}
     19 \begin{equation}
     20 	\loss{vae}(\vctr{\theta}, \vctr{\phi}) = \expectation_{Q(r\mid \vctr{e}, s; \vctr{\phi})}[ - \log P(\vctr{e}\mid r, s; \vctr{\theta})] + \beta \kl(Q(r\mid \vctr{e}, s; \vctr{\phi}) \mathrel{\|} P(r\mid\vctr{\theta}))
     21 	\label{eq:fitb:vae full loss}
     22 \end{equation}
     23 This is simply a rewriting of the \textsc{elbo} of Equation~\ref{eq:relation extraction:elbo} substituting relation extraction variables to the generic ones.
     24 There is however two differences compared to a standard \textsc{vae}.
     25 First, the variable \(s\) is not reconstructed, it simply conditions the whole process.
     26 Second, the regularization term is weighted by a hyperparameter \(\beta\).
     27 This makes the model of \textcite{vae_re} a conditional \(\beta\)\textsc{-vae} \parencitex{conditional_vae, beta_vae}[-11mm].
     28 The first summand of Equation~\ref{eq:fitb:vae full loss} is called the reconstruction loss since it reconstructs the input variable \(\vctr{e}\) from the latent variable \(r\) and the conditional variable \(s\).
     29 Since we followed the same structure for our model, this reconstruction loss is actually \loss{ep}, the difference being in the relation classifier.
     30 We can then rewrite the loss of \textcite{vae_re} as:
     31 \begin{marginparagraph}
     32 	As explained Section~\ref{sec:relation extraction:vae}, \(Q\) is the \textsc{vae}'s encoder.
     33 \end{marginparagraph}
     34 \begin{align*}
     35 	\loss{vae}(\vctr{\theta}, \vctr{\phi}) & = \loss{ep}(\vctr{\theta}, \vctr{\phi}) + \beta \loss{vae reg}(\vctr{\theta}, \vctr{\phi}) \\
     36 	\loss{vae reg}(\vctr{\theta}, \vctr{\phi}) & = \kl(Q(\rndm{r}\mid \rndmvctr{e}; \vctr{\phi}) \mathrel{\|} P(\rndm{r}\mid\vctr{\theta}))
     37 \end{align*}
     38 In their work, they select the prior as a uniform distribution over all relations \(P(\rndm{r}\mid\vctr{\theta}) = \uniformDistribution(\relationSet)\) and approximate \loss{vae reg} as follow:
     39 \begin{equation*}
     40 	\loss{vae reg}(\vctr{\phi}) = \expectation_{(\rndm{s}, \rndmvctr{e})\sim \uniformDistribution(\dataSet)} \left[ - \entropy(\rndm{R} \mid \rndm{s}, \rndmvctr{e}; \vctr{\phi}) \right]
     41 \end{equation*}
     42 Its purpose is to prevent the classifier from always predicting the same relation, i.e.~it has the same purpose as our distance loss \loss{d}.
     43 However, its expression is equivalent to \(-\loss{s}\), and indeed, minimizing the opposite of our skewness loss increases the entropy of the classifier output, addressing \problem{2} (classifier always outputting the same relation).
     44 Yet, using \(\loss{vae reg}=-\loss{s}\) alone, draws the classifier into the other pitfall \problem{1} (not predicting any relation confidently).
     45 In a traditional \textsc{vae}, \problem{1} is addressed by the reconstruction loss \loss{ep}.
     46 However, at the beginning of training, the supervision signal is so weak that we cannot rely on \loss{ep} for our task.
     47 The \(\beta\) weighting can be decreased to avoid \problem{1}, but this would also lessen the solution to \problem{2}.
     48 This causes a drop in performance, as we show experimentally.